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Abstract —The maximal information coefficient (MIC), which 
measures the amount of dependence between two variahles, is 
able to detect both linear and non-linear associations. However, 
computational cost grows rapidly as a function of the dataset size. 
In this paper, we develop a computationally efficient approxima¬ 
tion to the MIC that replaces its dynamic programming step with 
a much simpler technique based on the uniform partitioning of 
data grid. A variety of experiments demonstrate the quality of 
our approximation. 


I. Introduction 

One of the challenging issues for statisticians and computer 
scientists is dealing with large datasets containing hundreds 
of variables which some of them may have interesting but 
unexplored relationships with each other. This is due to 
examples of massive datasets in different areas such as; social 
networks, astronomy, genomics, medical records, and political 
science. Hence, it is an interesting topic to try to come up with 
methods which help us to discover these relationships. 

Measuring the amount of dependence among two variables 
has been extensively studied in the literature and several 
methods have been proposed for it. We review some of them in 
the following. In IT], the author has suggested seven properties 
to be satisfied by any measure y) used for determining 
the amount of dependence between x and y. These properties 
known as Renyi’s axioms are: 

• In defining (^{x, y) between any two random variables, 
neither x nor y should be constant with probability 1. 

• 0 < (l){x,y) < 1. 

• (j){x,y) = 0 if and only if x and y are independent from 
each other. 

• (/>(a;, y) = 1 if there is an arbitrary functional dependency 
between x and y. In other words, if y = f{x) or x = g{y) 
where /(.) and g{.) are Borel-measurable functions. 

. (t>{x,y) = (l>{y,x) 

m if /(.) and g{.) are strictly monotonic functions, then 
(j){x,y) = ^{f{x),g{y)). 

• if X and y are jointly Gaussian random variables, then 
(j>{x,y) = |PCC(a;,y)| where PCC is the Pearson corre¬ 
lation coefficient. 
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The Pearson correlation coefficient (PCC) is the most well 
known dependency measure. However, it is unable to detect 
non-linear associations. In other words, the PCC is only able 
to capture linear associations between two variables. 

As another measure of dependency, correlation ratio of 
random variable y (if <7^{y) exists and a{y) > 0) on random 
variable x, introduced in El and El, is defined as 


e(2/) 




( 1 ) 


It is easy to show that 0 < 0(t/) < 1 where 0(?/) = 1 if and 
only if y = f(x) in which f(x) is a Borel-measurable function 
and 0(j/) = 0 if a: and y are independent. The alternative 
formula of the correlation ratio mentioned in m is 


0(y) = sup|PCC(y,p(x))|. (2) 

(9) 

This alternative formula leads to another measure of depen¬ 
dency called maximal correlation 0: 

Six,y) =su.pFCC{f{x),g{y)), (3) 

f,9 

where /(.) and g{.) are Borel-measurable functions. The 
author in |T] has shown that S{x, y) = 0 if and only if x and y 
are independent. Purthermore, if there is an arbitrary functional 
relationship between x and y, then S{x,y) = 1. The authors 
in a have introduced the alternating conditional expectation 
(ACE) algorithm to find the optimal transformations. 

The Spearman correlation coefficient 0 is defined similar 
to the PCC; however, it is defined between the two ranked 
variables. By ranked variables we mean replacing each data 
point by its rank (or the average rank for equal sample 
points) in the ascending order. Therefore, if Xi and jji denote 
the ranked versions of Xi and the Spearman correlation 
coefficient would be 

The authors in El have expressed covariance and linear cor¬ 
relation in terms of principal components and generalized them 
for variables distributed along a curve. They have estimated 
their measures using principal curves. 

Mutual Information is another measure that can be 
used for quantification of dependency between two variables 





since it satisfies some common properties of other dependency 
measures. As an example I{x,y) = 0 if and only if x and y 
are independent. The authors in ||9l have used kernel density 
estimation of probability density functions in order to estimate 
the mutual information between two variables. In EOl, a 
method of mutual information estimation based on binning 
and estimating entropy from k-nearest neighbors is proposed. 

The MIC HD is recently proposed for quantifying depen¬ 
dency between two random variables. It is based on binning 
the dataset using dynamic programming technique to compute 
mutual information between different variables. It has two 
main properties which makes it superior in comparison with 
the aforementioned measures. First, it has generality meaning 
that if the sample size is large enough, it is able to detect 
different kinds of associations rather than specific types. 
Second, it is an equitable measure meaning that it gives similar 
scores to equally noisy associations no matter what type the 
association is. 

One of the problems with the MIC is the fact that its 
computational cost grows rapidly as a function of the dataset 
size. Since this computational cost may become infeasible, the 
authors in HD have applied a heuristic so as not to compute 
the mutual information for all possible grids. This heuristic 
application may result in finding a local maximum. 

In this paper, we develop a computationally efficient approx¬ 
imation to the MIC. This approximation is based on replacing 
the dynamic programming application used in computation 
of the MIC with a very efficient technique that is uniformly 
binning the data. We show that our proposed method is able to 
detect both functional and non-functional associations between 
different variables, similar to the MIC while more efficiently. 
In addition, it has a better performance in recognizing the 
independence between different variables. 

The rest of this paper is organized as the following. In 
section (HD, we review the MIC and the algorithm used to 
compute it from HD- In section (HID, we introduce our new 
measure of dependency that is a modification to the MIC. We 
present simulation results in section (IIVI) . Finally, section (O 
includes the conclusion of the paper. 


II. The Maximal Information Coefficient (MIC) 


A. MIC Definition and Properties 

For any finite dataset D which contains ordered pairs of 
two random variables, one can partition the first element, i.e., 
cc-value of these pairs into bins and similarly partition the 
second element or y-value of these pairs into ly bins. As a 
result of this partitioning, we will have an grid G. 

Every cell of this grid may or may not contain some sample 
points from the set D. This grid induces a probability distri¬ 
bution on the cells of G where the corresponding probability 
of each cell is equal to the portion of sample points located 
in that cell. That is to say 


= 


I A, I 


1^1 


( 5 ) 


jii’ column 



Fig. 1. Partitioning of dataset D into lx columns and ly rows. Dij denotes 
the set of sample points located in the z-th row and the j-th column. 


where pij denotes the probability corresponding to the cell 
located at the row and the column and \Dij\ denotes 
the number of sample points falling into the i-th row and the 
j-th column (See FigureHJfor a graphical view of the grid G). 
It is obvious that for each we will have a grid that 

induces a new probability distribution and hence results in a 
different mutual information between the two variables. 

Let /^|g,(P;(3) = maxc Id\g{PjQ) be the largest pos¬ 
sible mutual information achievable by an ix-^y-^y grid G 
on a set D of sample points. P and Q are the partitions 
of X-axis and Y-axis of grid G, respectively. In order to 
have a fair comparison among different grids, the computed 
values of mutual information should be normalized. Since 
/(P; Q) = H{Q) - H{Q\P) = H{P) ~ H{P\Q), we divide 
/^|q(P;( 5) by log(min(£a;,£y)). Therefore, we have 


0 < 


log(min(4,£y)) 


< 1 . 


(6) 


This inequality motivates the definition of the MIC as a 
measure of dependency between two variables. For a dataset 
D containing n samples of two variables, we have 


MIC(P) 


max -—^- rr. 

^xtv<B{n) \og{min{tx,iy)) 


(7) 


where B{n) = ® ifTTIl or more generally w(l) < Bin) < 

According to this definition, the MIC has the fol¬ 
lowing properties: 

. 0 < MIC(P) < 1. 

. MIC(a;,t/) =MIC(?/,a;). 

« It is invariant under order-preserving transformation ap¬ 
plied to the dataset D. 

m It is not invariant under the rotation of coordinate axes, 
e.g., if y = X, then MIC(P) = 1. However, after a 45° 
clockwise rotation of coordinate axes, instead of y = x 
we have y = 0 and hence MIC(P) = 0. 


B. MIC Algorithm 

Although the algorithm for computing the MIC is fully 
described in H2, here we only review the OptimizeXAxis 
algorithm which is used in computation of the highest mutual 













Fig. 2. OptimizeXAxis El considers only consecutive points falling into the 
same row and draw partitions between them. The set of consecutive points 
falling into the same row is called clump. 

information achievable by an grid. Any ^a;-by-^j^ grid 

imposes two sets of partitions on x-values (columns of grid) 
and y-values (rows of grid). We indicate columns of the grid 
by (ci, C 2 ,..., ci^) where Ci denotes the endpoint (largest x- 
value) of the z-th column. 

Since I{P,Q) is upper-bounded by H{P) and H{Q), in 
order to maximize it, one can equipartition either the Y or X 
axis, i.e., impose a discrete uniform distribution on either Q or 
P. Without loss of generality, we consider the version of the 
algorithm that equipartitions the Y-axis. However, it is obvious 
that we should check both of the cases (equipartitioning either 
the X or Y axis) separately for each grid and choose 

the maximum resulting mutual information. 

Let H{P) denote the entropy of distribution imposed by 
m sample points (m < |D| = n) on the partition of X- 
axis. Similarly, let H{Q) denote the entropy of distribution 
imposed by m sample points {m < \D\ = n) on the partition 
of Y-axis. Since we have assumed that the Y-axis is equipar- 
titioned, H{Q) is constant and equal to log(|(5|). Finally, 
let H{P,Q) denote the entropy of distribution imposed by 
m sample points (m < \D\ = n) on the cells of grid G 
which has X-axis partition P and Y-axis partition Q. Since 
/(P; Q) = H{Q) — H{Q\P) and we have already maximized 
H{Q) by equipartitioning the Y-axis, to achieve the highest 
mutual information, we have to minimize the P[{Q\P). This 
is done by the OptimizeXAxis algorithm ini. 

An alternative formula for the mutual information is 
/(P; Q) = H{Q)+H{P) — H{P, Q). Since H{Q) is constant, 
the OptimizeXAxis only needs to maximize H{P) — H{P, Q). 
The following theorem El is the key to solve this problem. 

Theorem II.l. For a dataset D of size n and a fixed row 
partition Q, and for every m, Z G N, if we define F{m,l) = 
max£)(i:m),|P|=i{F^(F’) ~ H{P,Q)} then for I > 1 and 1 < 
m < n we would have the following recursive equation 

1 jji — 2 

F{m,l) = max {— F{i,l—1) - F[({i,m),Q)}. (8) 

i<i<m m m 

Proof of Theorem 1/7. 71 See proposition 3.2. in El- ■ 

The OptimizeXAxis uses dynamic programming technique 
motivated by Theorem III. II It ensures F{n,l) that is the 
desired partition of dataset D (which has n sample points) 


Algorithm 1 OptimizeXAxis(P), (3,.(a:) El 

Require: P is a set of ordered pairs sorted in increasing order 

by x-values 

Require: Q is a Y-axis partition of D 

Require: ix is an integer greater than 1 

Ensure: Returns a list of scores {I 2 , ■ ■ ■, le^) such that each 

Ii is the maximum value of /(P; Q) over all partitions P of 

size I 

1 : (co,..., Cfc) •<— GetClumpsPartition(P,Q) 

2 : 

3: Find the optimal partition of size 2 
4: for f = 2 to fc do 

5: Find s G f} maximizing F[{{cs,ct)) — 

H{{cs,Ct),Q). 

6 : Pt ,2 (Cs, Ct) 

7: /t,2 ^P(Q) + P(Pt.2)-P(Pt.2,Q) 

8 : end for 

9: 

10 : Inductively build the rest of the table of optimal partitions 

11 : for Z = 3 to fa; do 
12 : for f = 2 to /c do 

13: Find s G {l,...,f} maximizing F{s,t,l) := 

where 7 ^* ^ is the number of points in the j-th column 
of Ps,i-iUct and is the number of points in the 
j-th column of Ps ;_i Uct that fall in the z-th row of 

Q 

•4: Pt,l ■<— Ps,l-1 U Ct 

15: Iu^H{Q)FH{Pt,i)-H{Pt,uQ) 

16 : end for 

17: end for 

18: return {hg,... 


having I columns imposing partition P over X-axis. In order 
to minimize the H{Q\P), OptimizeXAxis considers only con¬ 
secutive points falling into the same row and draw partitions 
between them. The set of consecutive points falling into the 
same row is called clump (See Figure |2] for a graphical view of 
clump). In Algorithm [U the GetClumpsPartition subroutine is 
responsible for finding and partitioning the clumps. Moreover, 
Pt.i is an optimal partition of size I for the first t clumps. 

III. The Uniform-MIC (U-MIC) 

A. Noiseless Setting 

The major drawback of the Algorithm[T]is its computational 
complexity. If there exists k clumps in the given partition 
of an fa;-by-fj; grid, the runtime of this algorithm would be 
Ofkfixty)- If there is a functional association between the 
two variables, the number of clumps in the corresponding 
grid is pretty small. However, for noisy or random datasets it 
is easy to imagine that the number of clumps is very large 
and hence the computational complexity of the Algorithm 
alg:OptimizeXAxis would be large. 




































Furthermore and due to this problem, this algorithm cannot 
be generalized in order to detect associations between more 
than two variables. As an example, if we want to detect 
whether or not three variables are related to each other, we 
may write the formula for the generalized mutual information 
as: 

/(P; Q- R) =H{P) + H{Q) + H{R) - H{P, Q) (9) 
- H{P, R) - H{Q, R) + H{P, Q, R). 

Hence, intuitively and like the case for two random variables, 
in order to maximize the generalized mutual information, 
we have to equipartition one axis to maximize the entropy. 
Nevertheless, we should partition the two other axes with 
respect to the places of the clumps in them, if we equipartition 
the first axis and there exists ki clumps in the second axis 
and k 2 clumps in the third axis, then the runtime of this 
algorithm would be Olkik^ix^yll) where £x,iy,lz are the 
sizes of partitioning. This runtime is not acceptable for large 
datasets. Therefore, we have to modify the algorithm in order 
to decrease its runtime and as a result make it generalizable 
to higher dimensions. 

The algorithm we propose in here for replacing the Algo¬ 
rithm [U is uniform partitioning (Algorithm |2]l. Let ymin = 
miriiyi, ?/max = maxjyj, and similarly Xmin = miuiXi 
and Xmax = maxj Xj. we then partition both X and Y 
axes such that all the columns have length 
similarly all the rows have length t^max-ymin ^ g^jj 

I——I 

measure, that is derived by replacing the Algorithm [T] with 
Algorithm |2] by the U-MIC (Uniform Maximal Information 
Coefficient). In the following we prove that the U-MlC will 
approach 1 as the sample size grows for when there exists 
a functional association between two variables (with finite 
derivative). Without loss of generality, we do all the proofs 
in the case that {x,y) € [0,1] x [0,1]. These proofs could be 
generalized to other cases easily. 


where A(T) denotes the fraction of sample points in the set 
T. Consequently 

gh(a) = Fy(a) = ¥(y < a) = ¥(h(x) < a), (11) 

where Fy(.) denotes the cumulative distribution function 
(CDF) and P(.) denotes the probability function. Using this 
notation and assuming that we uniformly partition V -axis by 
£y rows, we can write the entropy of Q which is the uniform 
partition of Y-axis as 


H{Q) 

iy-l 

= -^P(Q = z)log(P(Q = z)) 
i -f 1 


( 12 ) 


i=0 

iy-l 




i=0 

»-l 


< 


, ^ , i z -b 1 

log p ^ < Y < 


= - ^ log 

i—0 ^ ^ 

t —1 t —1 

= - ^5/i(oi)log(ffUai)) - ^ Y9'h{ai)\og 


where < ai < for each z (0 < z < £„ — 1) is 

ty ty y 

derived according to the mean value Theorem. If without loss 
of generality we assume that T[m\{£x,£y) = iy then we can 
write 


H{Q) 

iog(4) 


£y 1 ^ £y 1 ^ 

^ i^O ^ 

(13) 


As a result, in the asymptotic setting we can write 


Algorithm 2 UniformPartition(£ 2 ;, £y) 

Require: Dataset D 

Require: lx and £y are integers greater than 1 
Ensure: Returns a score I* which is the value of I{P;Q) 
where P are Q are distributions from uniform partitioning of 
both axes. 


1: P ^ Uniform partition of X-axis by £x columns each has 
length 

2: Q ir- Uniform partition of Y-axis by £y rows each has 
length 

r* _ H{P)+H{Q)-H{P,Q) 
log(min(£x,‘^i/)) 

4: return I* 


Proposition III.l. If D = {{xi,yi)}f^i where yi = h[xi) 
and |ft.^(x)| < oo, then lim„_j.oo U-MIC{D) = 1. 


Proof of Proposition MIL 1 1 We denote by gh{a) the sub- 
level function of function h{.), i.e.. 


gh{a) = A({x : h{x) < a}), 


lim 

£y—¥00 


H{Q) 

log(4) 




(14) 


where the last equality holds since lim^^^^ooX^z 1^ 

the Riemann integral of the function g'^^iaf). If we assume 
that |/i'(.)| < c, then according to the mean value Theorem 
we have 



Equation (fTSl l states that for a particular column of the X-axis 
partition, the curve of the function passes through at most c-|-l 
cells of that column. We use this fact in upper-bounding the 
H{Q\P). Similar to (fT^ and (fl^ we have 


iy 1 

P(Q|P = fc) = -^P(Q = z|P = A:) (16) 

2 = 0 


( 10 ) 




















X log(P(Q = = k)) 

£y — l 

= -E: 


2 = 0 


^ <Y < ^-^\P = k 


xioglPlf <y<^|p = fc 


1 ^ 

= - E = ^)t 

i=o 

X log ^/ylxKI^’ = k)^'^ 


e^-1 


= - E 


2=0 


X log(/y|a:(aj|-P = k)) 

1 /I 

“ E ^/yk(«*l^ = ^)log ^ 


2=0 


where fy\x denotes the conditional probability density func¬ 
tion. Because of equation (flSl t, we can simplify (fT^ as 


Jc + l 


H{Q\P = k) = -Y^ ^fy\x{ai\P = k) 

2=jl ^ 

X log(/y|^(a,|P = k)) 


(17) 




If we define k* = argmaxk H{Q\P = k), then since 
H{Q\P) = ~ k)H{Q\P = k), we can write 


H{Q\P) < -Y, = k*) 

t-'y 

t—Jl ^ 

X log(4|^(a,|P = k*) 

- E jh\M^\P = k*) log , 

i—ji ^ \ y / 

and hence 


(18) 


lim 

4^00 10g(£y) 


Jc+1 


0 

II 

II 

H 

(19) 

e 0 but c < 00 . As a result 

i{P;Q) 

log(mm{4,4}) 

(20) 

H{Q)-H{Q\P) ^ 

log(min{4,4}) 



Proof of Proposition \III.2\ The line of reasoning is 
straight forward and similar to the proof of Proposition IIII.ll 
Since x and y are independent from each other, we can write 


H{Q) 


( 21 ) 


= - 5^P(Q = *)log(P(Q = z)) 
i + 1 


i=0 

4-1 


= -E 
2 = 0 
ly--\ 

= -E 


— < y < 

4 ■ 4 


iog(p(f <y<^ 
y ^y 


2=0 


i<y<i^|p = t 


-y 


Cy 


xiog(P( 4<r<i^|P = fc 

^ty ty 


u-l 


= - y] P(Q = i\P = k) log(P(Q = i\P = k)) 


i=0 


= H{Q\P=k). 

Therefore, H{Q) = P[{Q\P = k) for every k where 0 < k < 
lx — 1- Now since H{Q\P) = Jf,k'^(P ~ k)H{Q\P = k), 
we have H{Q) = H{Q\P) and as a result U-MIC(i7)=0. ■ 


B. Noisy Setting 

In this section we study performance of the U-MIC in 
noisy setting. We first give a lower-bound on it when the 
two variables x and y have a noisy functional association in 
which the noise is bounded. After that, we study the case of 
unbounded noise. 

For the bounded noise case, without loss of generality 
we assume that x ^ U[0, 1] and the noise has a uniform 
distribution. Specifically, we assume that sample points {xi, yf) 
have the form (a:^, h(xi) + zf) where 4 ~ [/[—e, e]. We define 
2/mid = In Algorithm 12] we divide the Y-axis into 

two rows by drawing a horizontal line at t/mid- In addition, 
we divide the X-axis into lx columns each having the length 
j- (since x t/[0,1]). Let Di = {(xi,y*)|?/i < ymid} and 
D 2 = {{xi,yi)\yi > 2/mid}- We use V and Q to denote the 
partition of X-axis and Y-axis of the grid in this setting. Having 
this setting and notations in mind, the following Corollary 
gives a simple lower-bound for U-MIC(Z)) in this case. 

Corollary III.3. Let m be the number of columns in V in 
which there exists a sample point {x,y) such that |y — 2/mid| < 
e. Then, U-MIC(D) is lower-bounded by 

\p\ log(l-P|) - l-Pil log(l-Pil) - I-P2I log(|L>2|) _ m 
\D\ lx 


If X and y are independent, then according to the following 
Proposition we have U-MIC(D) = 0. 

Proposition III.2. If D = {(iCi, 2/i)}?=i where Xi X yi for 
1 <i <n, then U-MIC{D) = 0. 


Proof of Corollary I///. 31 Since I{V,Q) = H{Q) — 
H{Q\P), we need to have an upper-bound on H{Q\'P) in 
order to determine a lower-bound on I{P,Q). According to 



















2€n + 1 Points 





Proof of Lemma \III.4\ Let I(.) denote the indicator 
function. Then we can write 

n 

2e„ = \J\f \ = - Sn < Xj < Xi+ Sn). (24) 

As a result IE[2e„] = 2n(5„. Using the Hoeffding inequality 
we have 


P(|2e„ -E[2e„]| > t) < (25) 


Fig. 3. Using k-nearest neighbors method to bound the noise in noisy 
relationships. We replace each point with the average of its neighbors in its 
5n -neighborhood. 


the entropy definition we can write 


i^(Q) = -^log 


\D, 


\D2 

D\ 


■log 


\D2 

\D\ 


\D\ ^\\D\J \D\ ^\\D\J 

\D\ log(|jJ|) - \D,\log{\D,\) - \D 2 \ log(|fJ 2 |) 
ini 


( 22 ) 


Let A4 = , Ip^ } denote the columns in which there 

exists a data point {x,jj) such that \y — i/mid| < £■ Since Q 
has only two rows, we can upper-bound the H{Q\V) as the 
following 


ix-l 

H{Q\V) = ^ P(iP = k)H{Q\V = k) (23) 

/f-0 

= f f E ^( 21 ^ = ^) + E ^( 21 ^ = 

^ VfeeAt k(^M / 

= f T. ma\p = k) 

^ k^M 

\M.\ m 

“ ~ “ 4’ 

where (a) holds since x ^ C^[0,1] and (b) holds because Ze 
f/[—e, e]. The lower-bound is then derived by combining (l2^ 
and d^ . ■ 

The main issue with generalizing this lower-bounding idea 
to other noise distributions is that noise values could be 
unbounded. Hence, we use the idea of k-nearest neighbors 
to bound the noise so as to come up with a consistent version 
of the association detector. We study this idea for the case that 
noise is drawn from a Gaussian distribution with 0 mean and 
variance of cr^. 

For each sample point, we consider its 5„-neighborhood 
(we use subscript n to show the dependency on the size of 
the dataset n). We replace each data point with the average of 
sample points located in its (5„-neighborhood. The following 
lemma characterizes the number of sample points in this 
neighborhood. 

Lemma III.4. Let x be uniformly distributed, i.e., x ~ C/[0,1] 
and {xi, yf) denote the i-th data point in D where yi = h{xi)-\- 
Zi. If M = {{xj,yj)\{x^ - XjY < then lim„^oo \ff\ = 
2n6n. 


for some constant c. If we let t = then lim„_).oo(en) = 
n5n or equivalently, lim„^oo \Af \ = 2n5n- ■ 

Assume that h{.) is a Lipschitz continuous function of order 
j3, i.e., \h{v) — h{w)\ < k\v — w\^ where k is a constant 
depends on the function h{.). If we estimate (or replace) the 
y-value of each noisy sample point with the average of sample 
points in its -neighborhood, in the case of Gaussian noise (0 
mean and variance of a^) we can write the estimation mean 
squared error as 


1 

A„ = -y^E{h{xi) - h{xi)f 

2 e„ -I- 1 

< - - —I- 

“ 2tn + 1 


(26) 


2 


h{xi) 


In order to minimize the estimation error we can take deriva¬ 
tive with respect to and set it to 0. Therefore, e„ which 
minimizes the mean squared error is 


e 


4= 

n 



2^TT , 


(27) 


We use this e* later to to bound the noise. The following 
lemma gives a probabilistic bound on the noise values. 

Lemma III.5. If z\, Z 2 ,. ■., Zn are i.i.d. drawn from N (0, cr^), 

-t2 

then P{maxi<i<„ \zi\ > t} < 2ne^. 

Proof of Lemma I7//.51 First of all, for a zero mean Gaus- 

— 

sian random variable Zi we prove that P{| 2 :i| > f} < 2ne'^. 
Let u = Zi — t and hence u ~ N{—t, a^). We have 

f ^ 2 t f f ^ 

, e ^ du< , e 2 ^^ du < 1. (28) 

V2^ ~ Jo 



As a result we can write 


1 

dzi = 


'j2'Ka'^ 


1 

e du < 6 2 ^. (29) 


s/2'Ka 


Similarly, ( |29] | holds for [—oo,—f] and hence P{| 2 i| > t} < 
2e'^. The result of lemma then follows from using union- 
bound on the ZiS. ■ 

By using the fe-nearest neighbors method, each Zi is re¬ 
placed by Zi which is the average of 2e„ -f 1 i.i.d. noise values 
and hence its variance is decreased by 2e„ -f 1. This idea 































Fig. 4. Test functional relationships for Tables im and 



Fig. 5. Test non-functional relationships for Tables IWlandlVl 

motivates the following corollary which lets us to bound the 
noise. 

Corollary III.6. By using the k-nearest neighbors method, 
Zi = ^ result lim„_j.oo maxi<i<„ \zi\ = 

0 . 

Proof of Corollary \III.6\ According to the Lemma IIII.5I 

-t^(2€n+l) 

we can write P{maxi<i<„ \zi\ > t} < 2ne . The 

result then follows from letting t = and e„ = e* which 
was derived in dlTli. ■ 

In the next section we show how the U-MIC works in 
practice comparing to the MIC. 

IV. Simulation Results 

In this section, we study the performance of our proposed 
measure. We first show how it works for functional associ¬ 
ations. Second, we study its performance for non-functional 
associations. Finally, we do some experiments for the case 
of noisy relationships. As mentioned previously, the authors 
in ED apply a heuristic to compute the MIC which may not 
result in the true MIC. On the other hand, we do not apply any 
heuristic in the simulation results in order to have a precise 
comparison with our proposed method. 

Figure |4] shows the functional associations that we have 
tested the performance of the MIC and U-MIC algorithms on. 
Table U summarizes the results for the case that there are 200 
sample points. One interesting point in Table U is the value of 
the U-MIC for sinusoidal function with different frequencies. 
As we can see, MIC(L>)=1 while U-MIC(i:')=0.75 for this 


function. One interpretation of this difference is that in the 
proof of Proposition IIII.ll we have assumed that the absolute 
value of derivative of function h{.) is upper-bounded by 
constant c. However, this is not the case for sinusoidal function 
with different frequencies since there is a discontinuity in this 
function. If we increase the sample size, as reported in Table 
El this issue is alleviated as we can see. 

TABLE I: MICCH) and U-MIC(L>) for different functional 
relationships in Figure |4] For this set of experiments, \D\ = 


2Qa 



Linear Parabolic Periodic 

Cubic 

Sin (Diff. Freq.) 

Sin (Single Freq.) 

MIC 

1 1 1 



1 

U-MIC 

1 1 0.93 

0.95 

0.75 

0.91 


TABLE II: MIC(LI) and U-MIC(LI) for different functional 
relationships in EigureH) Eor this set of experiments, \D\ = 


SQOa 



Linear Parabolic Periodic 

Cubic 

Sin (Diff. Freq.) 

Sin (Single Freq.) 

MIC 

1 1 1 



1 

U-MIC 

1 1 0.99 

0.99 

0.93 

0.95 


TABLE III: Run time (in sec.) for calculation of MIC(£)) and 
U-MIC(D) for different functional relationships in Eigure |4] 
Eor this set of experiments, \D\ = 200. 



Linear 

Parabolic 

Periodic 

Cubic 

Sin (Diff. Freq.) 

Sin (Single Freq.) 

MIC 

0.1 

0.5 

0.1 

0.2 

2 

0.4 

U-MIC 

0.01 

0.01 

0.01 

0.01 

0.01 

0.01 


Although the same issue holds for periodic function in 
Eigure ID we do not see that much effect. Qualitatively, the 
derivative of continuous pieces of the periodic function in 
Eigure |4] (y = x) is smaller than the maximum of the 
derivative of sinusoidal function with different frequencies 
{y = sin(lOx), y = sin(20x)). Hence, if we uniformly 
partition the X-axis in the case of periodic function, there 
would be fewer sample points in rows of a certain column and 
more probably higher entropy (resulting in higher U-MIC), as 
the case in Table U 

Table HIJ summarizes the runtime for calculation of the MIC 
and U-MIC for different functional associations in EigureH) As 
we can see, the U-MIC is at least 10 times faster in these cases. 
This is expected since the MIC uses dynamic programming to 
find a close to optimal grid for the data while the U-MIC just 
uniformly partitions the axes. 

TABLE IV: MIC(L>) and U-MIC(D) for different non¬ 
functional relationships in Eigure |5] Eor this set of experi- 


ments, \D 

= 200. 


Circle 

Sinusoidal Mixture 

Two Lines 

Random 

MIC 

0.68 

0.72 

0.71 

0.16 

U-MIC 

0.64 

0.69 

0.68 

0.06 


Table |IV] summarizes the results for non-functional associ¬ 
ations presented in Eigure |5] One important point about Table 






























































TABLE V; Run time (in sec.) for calculation of MIC(D) and 
U-MIC(D) for different non-functional relationships in Figure 
1 3 For this set of experiments, \D\ = 200. _ 



Circle 

Sinusoidal Mixture 

Two Lines 

Random 

MIC 

26.38 

13.41 

16.60 

61.00 

U-MIC 

0.01 

0.02 

0.01 

0.02 


m is that the U-MIC has a better performance in the case of 
random sample points (i.e., x X y). In this case, the ideal MIC 
and U-MIC is 0; however, as we can see MIC(I?)=0.I6 and 
U-MIC(D)=0.06. This issue is related to one of the criticisms 
made about MIC in the literature IE). One of the drawbacks 
of the MIC is the fact that as a statistical test it has a lower 
power than other measures of dependency such as distance 
correlation ca. In other words, it gives more false positives in 
detecting associations. However, according to our simulation 
results and Proposition IIII.2I this issue is alleviated in the U- 
MIC. 

Table |V] shows the runtime for calculation of the MIC and 
U-MIC. In the case of non-functional relationships we have 
more clumps in the initial grid of sample points for calculation 
of the MIC. Hence, Algorithm [T] which is basically running 
dynamic programming over initial grid to find the optimal 
grid, would have larger runtime as we can see in Table |V] 
On the other hand, since the U-MIC is dealing with uniform 
partitioning of the grid of sample points, it does not matter 
what type of relationship the two random variables have. The 
runtime is almost constant and similar to the cases that there 
is a functional association between two variables. 

Tables |VT] and IVIII summarize the results for noisy non¬ 
functional associations presented in Figure |6] Figure |6] is 
similar to Figure |5] except for the fact that we have added 
noise drawn from uniform distribution, i.e., ( 7 [— 0.5, 0.5] to the 
sample points. Comparing Table I VII with Table HVl we can see 
that the range of decrease for different associations is almost 
the same for the both MIC and U-MIC. We expected this for 
MIC since it has an important property called equitability im. 
On the other hand, we can observe that at least according to 
the simulation results reported here, U-MIC has approximately 
the same equitability property. 

V. Conclusion 

In this paper we introduced a novel measure of dependency 
between two variables. This measure is called the uniform 
maximal information coefficient (U-MIC) because it is a mod¬ 
ification of the original MIC ifTTIl . It is derived from uniform 
partitioning of the both X and Y axes. Therefore, it is not 
dealing with dynamic programming similar to what the MIC 
does and hence, is much faster. We proved that asymptotically, 
U-MIC equals to 1 if there is a functional relationship between 
two variables. If two variables are truly independent from each 
other, then we showed that the U-MIC would be equal to 0. 
Specifically, according to the simulation results, we showed 
that the U-MIC does a better job in recognizing independence 
between variables comparing to the MIC. 



Fig. 6. Test noisy non-functional relationships for Tables |yi] and lyill 

TABLE VI: MIC(i:i) and U-MIC(LI) for different non¬ 
functional relationships in Figure |5] For this set of experi¬ 
ments, \D\ = 200 and noise is uniformly distributed in [- 
0.05,0.051. _ 



Circle 

Sinusoidal Mixture 

Two Lines 

MIC 

0.54 

0.60 

0.57 

U-MIC 

0.52 

0.48 

0.54 


TABLE VII: Run time (in sec.) for calculation of MIC(D) and 
U-MIC(D) for different noisy non-functional relationships in 
Figure ID For this set of experiments, \D\ = 200 and noise is 
uniformly distributed in [-0.05,0.05]. _ 



Circle 

Sinusoidal Mixture 

Two Lines 

MIC 

35 

16 

27 

U-MIC 

0.01 

0.02 

0.01 
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